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Abstract 

In this paper we tackle the problem of estimating the power-law tail exponent 
of income distributions by using the Hill's estimator. A subsample semi-parametric 
bootstrap procedure minimising the mean squared error is used to choose the power- 
law cutoff value optimally. This technique is applied to personal income data for 
Australia and Italy. 

Key words: Personal income, Pareto's index, Hill's estimator, bootstrap 
PACS: 02.50.Ng, 02.50.Tt, 02.60.Ed, 89.65.Gh 



1 Introduction 



Since Pareto it has been recognized that a power-law provides a good fit for 
the distribution of high incomes [1]. The Pareto's law asserts that the com- 
plementary cumulative distribution P > (y) = 1 — S~ooP{C) d£ — ► P> (u) {^j , 
with y > u, where u > is the threshold value of the distribution and a > 
turns out to be some kind of index of inequality of distribution. The fit of such 
distribution is usually performed by judging the degree of linearity in a double 
logarithmic plot involving the empirical and theoretical distribution functions, 
in such a way that the estimation of u of the distribution does not seem to 
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follow a neutral procedure. Moreover, recent studies have criticized the relia- 
bility of this geometrical method by showing that linear-fit based methods for 
estimating the power-law exponent tend to provide biased estimates, while the 
maximum likelihood estimation method produces more accurate and robust 
estimates [2,3]. Hill proposed a conditional maximum likelihood estimator for 
a based on the k largest order statistics for non-negative data with a Pareto's 
tail [4]. That is, if y [n] > y [n _i] > > y [n _ fc] > . . . > y {1] , with y$ denoting 
the i th order statistic, are the sample elements put in descending order, then 
the Hill's estimator is 

-i 

(1) 

where n is the sample size and k an integer value in [l,n]. Unfortunately, 
the finite-sample properties of the estimator (Eq. 1) depend crucially on the 
choice of k: increasing k reduces the variance because more data are used, 
but it increases the bias because the power-law is assumed to hold only in the 
extreme tail. 

Over the last twenty years, estimation of the Pareto's index has received con- 
siderable attention in extreme value statistics [5]. All of the proposed esti- 
mators, including the Hill's estimator, are based on the assumption that the 
number of observations in the upper tail to be included, k, is known. In prac- 
tice, k is unknown; therefore, the first task is to identify which values are 
really extreme values. Tools from exploratory data analysis, as the quantile- 
quantile plot and/or the mean excess plot, might prove helpful in detecting 
graphically the quantile y[ n -k] above which the Pareto's relationship is valid; 
however, they do not propose any formal computable method and, imposing 
an arbitrary threshold, they only give very rough estimates of the range of 
extreme values. 

Given the bias-variance trade-off for the Hill's estimator, a general and for- 
mal approach in determining the best k value is the minimisation of the Mean 
Squared Error (MSE) between a n (k) and the theoretical value a. Unfortu- 
nately, in empirical studies of data the theoretical value of a is not known. 
Therefore, an attempt to find an approximation to the sampling distribu- 
tion of the Hill's estimator is required. To this end, a number of innovative 
techniques in the statistical analysis of extreme values proposes to adopt the 
powerful bootstrap tool to find the optimal number of order statistics adap- 
tively [6,7,8,9]. By capitalizing on these recent advances in the extreme value 
statistics literature, in this paper we adopt a subsample semi-parametric boot- 
strap algorithm in order to make a reasonable and more automated selection 
of the extreme quantiles useful for studying the upper tail of income distri- 
butions and to end up at less ambiguous estimates of a. This methodology 
is described in Section 2 and its application to Australian and Italian income 
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data [10,11] is given in Section 3. Some conclusive remarks are reported in 
Section 4. 



2 Estimation Technique for Threshold Selection 



In this section we consider the problem of finding the optimal threshold u* n - 
or equivalently the optimal number k* of extreme sample values above that 
threshold - to be used for estimation of a. In order to achieve this task, we 
minimize the MSE of the Hill's estimator (Eq. 1) for a series of thresholds 
« n = U[ n -k], and pick the u n value at which the MSE attains its minimum 
as it*. Given that different threshold series choices define different sets of 
possible observations to be included in the upper tail of a specific observed 
sample y n = {yf, i — 1,2, ... , n}, only the observations exceeding a certain 
threshold that are additionally distributed according to a Pareto's cumula- 
tive distribution function PD &n ^ )Un (y) are included in the series. In order 
to check this condition, we perform for each threshold in the original sam- 
ple a Kolmogorov-Smirnov (K-S) goodness-of-fit test for the null hypothe- 
sis H : F n (y) = PD &n (k),u„ (y) versus the general alternative of the form 
Hi : F n (y) ^ PD &n ( k ^ Un (y), where F n (y) is the empirical distribution func- 
tion, and a n (k) is a prior estimate for each threshold u n of the Pareto's tail 
index obtained through the Hill's statistic. Following the methodology in [12], 
the formal steps in making a test of H Q are as follows: 

(a) Calculate the original K-S test statistic D by using the formula D = 

sup F n (y) - PD &n{k)jUn (y) 

— oo<j/<oo 

(b) Calculate the modified form T* by using the formula 
r* = £>(v^ + 0.12 + ^y (2) 



(c) Reject H if T* exceeds the cutoff level, z, for the chosen significance 
level. 



To obtain an estimate of finite-sample bias and variance (and thus MSE) at 
each threshold coming from the null hypothesis H , a natural criterion is to use 
the bootstrap [13]. In its purest form, the bootstrap involves approximating an 
unknown distribution function, F(y), by the empirical distribution function, 
F n {y). However, most times the empirical distribution model from which one 
resamples in a purely non-parametric bootstrap is not a good approximation 
of the distribution shape in the tail. Therefore, we initially smooth the tail 
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data by fitting a Pareto's cumulative distribution function 

& n (k) 



PDa n ( k ),u n (y) = v = i - P> K) (^j 



(3) 



to the n\ < n observations y ni = {y G y n : T* < z}, and then use the quan- 
tiles = G y ni : PDa n (k),u n (y) > p} obtained directly from inverting the 
estimated model (Eq. 3) to draw the bootstrap samples. 

Let us here summarize the adopted methodology: 

(1) Evaluate the estimate a n (k) of the Pareto's tail index for each threshold 
in the original sample y n by using the Hill's estimator (Eq. 1). 

(2) For each threshold in the original sample, test the Pareto's approximation 
by computing the value of the K-S test statistic (Eq. 2). 

(3) Fit the model (Eq. 3) to the subset of data y ni belonging to the null 
hypothesis H . 

(4) Select R independent bootstrap samples yf , yf, . . . , y*, each consisting 
of ni values drawn with replacement from the set of quantiles y£ obtained 
by inverting the fitted model (Eq. 3). 

(5) For each bootstrap sample yf, r = 1, 2, . . . , R, and for each threshold 

in the bootstrap sample, evaluate the bootstrap estimate a* (ki) of 
the Pareto's tail index by using the Hill's estimator (Eq. 1). 

(6) For each threshold u#, calculate the bias, B = E [o:# (hi) — a n (k), the 

variance, Var = E j a* (A4) j — ^E a* (£4)] j , and the mean squared 

error, MSE = B 2 + Var, of the Hill's tail index estimates. 

(7) Select as the optimal threshold u* n = y\ n -k*\ that threshold where the 
MSE attains its minimum. 



Minimising the MSE, thus, amounts to find the MSE minimising number of 
order statistics k* = arg r 

of the tail index a* (k*). 



order statistics k* = arg min MSE, from which one infers the optimal estimate 



3 Empirical Application: The Australian and Italian Personal In- 
come Distributions 



The data sources we use to illustrate how the methodology proposed in Section 
2 can be applied to the analysis of income distributions have been selected from 
the nationally representative cross-sectional data samples of the Australian 
and Italian household populations. In particular, we have analyzed the Total 
annual income from all sources in the years 1993-94 to 1996-97, and then 



4 



(») 



^ 5'X perK'nta^e paiiit = 1.358 

■ . ; T* = D(yK + 0.12 + 0.11/y5>) > 1.3 
• T" = + 0.12 + 0.11/vS) < 1.3 

- 

- f 


58 • 
58 

to 









i — 5',< pcrccntii^i 1 point: = 1.358 


• : I" = D(j/n + 012 + 0.11/vfn) > 1358 


• T* = £)( v /n + 0.12 + Oll/v'") < 1.358 






^ ^ 









10 12.5 
Tail size (%) 



10 12.5 15 17.5 
Tail size (%) 



Fig. 1. Modified if-,? statistic (Eq. 2) as a function of the tail size for (a) Australia 
in 1999-2000 and (b) Italy in 2000. 

in 1989-90, 1998-99, 1999-2000, and 2001-02 for Australia, and 1977-2002 
for Italy [10,11,14]. Here we report only the results in the year 1999-2000 for 
Australia and 2000 for Italy. 

Figs. 1 (a) and (b) depict the outcomes of the complete sequences of K-S test 
for a selection of tail fractions. Blue points (see on line version) mark all the 
observations for which the modified K-S statistic (Eq. 2) does not exceed the 
5% cutoff level z = 1.358 (solid lines in the figures). The 5% significance point 
z = 1.358 comes from Table 1A in [12]. The figures indicate the tail regions 
that may be tentatively regarded as appropriate for the implementation of the 
semi-parametric bootstrap technique. 



The Hill's estimator (Eq. 1) is reported in Figs. 2 for Australia (a) and Italy 
(b), and for tails < 20% and < 25% of the full sample size respectively (see 
solid lines). In these figures, the optimal number of extreme sample values are 
reported, namely k* = 299 for Australia and k* = 3222 for Italy, providing 
the following values for the tail power-law exponents: a* (k*) = 2.3 ± 0.2 
and a* n (k*) = 2.5 ± 0.1, where the errors (with 95% confidence) have been 
obtained through the jackknife method [15]. In these computations, we have 
used 1000 resamples and the subsample size has been set equal to the number 
of observations not rejected by the K-S test at the 5% level (see Section 2 
and Figs. 1 (a) and (b)). Repeated calculations with a different number of 
replications produce a spread of tail index estimates with deviations inside 
the 95% uncertainty band (dashed lines in the figures), showing therefore 
numerical robustness of our results. We have here obtained more precise values 
of the power-law tails than the previous one reported in the literature [11]. 

The use of these a* optimal values produces the fits shown by the solid lines 
in Figs. 3 (a) and (b) for Australia and Italy, where the complementary cu- 
mulative distributions are plotted on a log-log scale. The vertical dashed lines 
indicate the optimal values of the threshold parameter attained by subsample 
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Fig. 2. The Hill's estimator (Eq. 1) for (a) Australia in 1999-2000 and (b) Italy 
in 2000. The dashed lines represent the 95% confidence limits of the tail index 
estimates computed by using the jackknife method. The arrows mark the optimal 
number of extreme sample values k*. 

semi-parametric bootstrapping: (a) u* n = $ 82367 for Australia in 1999-2000 
and (b) u* n = €19655 for Italy in 2000. As we can see, our procedure succeeds 
in avoiding deviations from linearity for the largest observations that might 
strongly influence the estimation of a, illustrating therefore the importance of 
optimally choosing the tail threshold. 



4 Concluding Remarks 

In this paper we have considered the problem of the estimation of the power- 
law tail exponent of income distributions and we have adopted a subsample 
semi-parametric bootstrap procedure in order to arrive at less ambiguous esti- 
mates of a. This methodology has been empirically applied to the estimation 
of personal income distribution data for Australia and Italy. The reliability 
and robustness of the results have been tested by running different repeated 
bootstrap replications and comparing the variability of the estimates through 
a jackknife method. 

From the economic point of view, this technique for the estimation of the 
Pareto's tail index of income distribution is expected to allow a deeper under- 
standing of both the way in which cyclical fluctuations in economic activity 
affect factor income shares and the channels through which these effects work 
through the size distribution of income, which are issues of relevance for the 
modeling of the income process in the high-end tail of the distribution. 
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Fig. 3. Complementary cumulative distribution (a) for Australia in 1999-2000 and 
(b) for Italy in 2000 and power-law fits by using the estimated optimal values for 
a. 
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